1 Introduction

The purpose of this script is to fit the Cubist and Random forest models to the training data set and generate predictions on the test data set.

For the training dataset, the response period measures the crime rates for Jan 1, 2019 - Dec 31, 2019. We use the index date 2018-12-31 to label this script. So the index date is the observation date when predictors are known before fitting the model to the training period response.

The base date is 2019-12-31 in this document. It is the last date before the start of the test period. For the test dataset, the response period measures the crime rates for Jan 1, 2020 - Dec 31, 2020.

Both models will be fitted on the training dataset to predict the response of the test dataset.

grid_cellsize = 490  # Hard-coded

loadRDSModel = TRUE # Skips Cubist Tuning entirely.  If TRUE then tuningFull is not used.

loadRandomForestModel = TRUE  # Skips Random Foret Tuning entirely.  If TRUE then tuningFull is not used.

tuningFull = TRUE   # Tune the model with various calibration (TRUE = takes a long time,  FALSE takes less time)  Only applies if loadRDSModel == TRUE
all_data_on_grid_training_file = paste0(data_dir, "EXP_ALL_DATA_ON_GRID_490_2018-12-31.geojson")

all_data_on_grid_test_file = paste0(data_dir, "EXP_ALL_DATA_ON_GRID_490_2019-12-31.geojson")

To generate the training and test data sets, we load the grid files and strip out non-predictor or redundant columns. Since we are predicting crime frequencies, we strip out the crime outs. Since the models don’t require the grid geometries, we drop the geometry column.

2 Cubist Model - 2020

We train the Cubist model using caret with 10-100 committees and 0-9 neighbors. The selection criteria is the best model based on RMSE.

set.seed( 1027)

cubistControlFull <- trainControl(method = "cv" ,  selectionFunction = "best")
tuneGridFull  <- expand.grid( committees = c( 10, 50, 100 ) ,
                              neighbors = c( 0, 1, 5, 9 )
                                                  ) 

cubistControlLight <- trainControl(method = "cv" ,  selectionFunction = "best", verboseIter = TRUE)
tuneGridLight <- expand.grid( committees = c( 100 ) , 
                              neighbors = c( 0  )  )

rdsFileName = "cubistTune_train_2018-12-31.rds"

if(loadRDSModel == FALSE ){
  
   if(tuningFull == TRUE)
   {
        cubistControl = cubistControlFull
        cubistGrid = tuneGridFull
   } else  {
        cubistControl = cubistControlLight
        cubistGrid = tuneGridLight
   }

    (cubistTune = caret::train( x = X_vars_on_grid_train, 
                          y = Y_on_grid_train , 
                          method = "cubist",
                          tuneGrid = cubistGrid ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = cubistControl ) )
  
    saveRDS(cubistTune, file = rdsFileName)
  
} else {
  
   cubistTune = readRDS(rdsFileName)
}

cubistTune
## Cubist 
## 
## 19782 samples
##    48 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17803, 17804, 17804, 17804, 17804, 17804, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##    10         0          0.8139778  0.6795867  0.2316203
##    10         1          0.9730717  0.5662784  0.3144288
##    10         5          0.8288617  0.6555496  0.2856498
##    10         9          0.8124881  0.6689108  0.2801826
##    50         0          0.8113386  0.6836329  0.2309959
##    50         1          0.9710231  0.5692205  0.3137215
##    50         5          0.8266854  0.6581639  0.2849059
##    50         9          0.8096557  0.6723146  0.2793732
##   100         0          0.8109074  0.6842842  0.2307630
##   100         1          0.9709695  0.5691123  0.3135567
##   100         5          0.8260157  0.6588417  0.2847187
##   100         9          0.8092202  0.6726947  0.2792815
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 100 and neighbors = 9.

The Cubist dotplot of the splits are not that useful because they take a long time to run.

Next, we consider the prediction on the test set.

Let’s use PAI and AUC to measure the quality of the hotspot predictions.

We can define hotspots as the top 100 grid cells by predicted crime frequency.

## [1] "Area in miles^2 of Raleigh Buffered at 100 feet: 156.116262020913 miles"
## Loading required namespace: raster

3 Random Forest Model

Now we evaluate the Random forest predictive accuracy

set.seed( 1027)

rfControlFull <- trainControl(method = "cv" ,  selectionFunction = "best", verboseIter = TRUE)
rfTuneGridFull  <- expand.grid( .mtry = seq(2, 20, by = 4) , .min.node.size = c( 5, 10 ) , .splitrule = "variance"  )


rfControlLight <- trainControl(method = "cv" ,  selectionFunction = "best", verboseIter = TRUE)
rfTuneGridLight  <- expand.grid( .mtry = c(5, 10) , .min.node.size = c(5), .splitrule = "variance" )
                  
                                                  
rfRdsFileName = "randomForestTune_train_2018-12-31.rds"

if(loadRandomForestModel == FALSE ){
  
   if(tuningFull == TRUE)
   {
        print("tuningFull is TRUE")
        rfControl = rfControlFull
        rfGrid    = rfTuneGridFull
   } else  {
        print("tuningFull is FALSE")
        rfControl = rfControlLight
        rfGrid    = rfTuneGridLight
   }
  
    (randomForestTune = caret::train( x = X_vars_on_grid_train, 
                          y = Y_on_grid_train , 
                          method = "ranger",
                          tuneGrid = rfGrid ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          importance = "impurity" ,
                          num.trees = 1000, 
                          trControl = rfControl ) )
  
    saveRDS(randomForestTune, file = rfRdsFileName)
  
} else {
  
   randomForestTune = readRDS(rfRdsFileName)
}

The tuning results are shown below. There is not much variation in the MAE or R-Squared after we pick mtry greater than 2.

## Random Forest 
## 
## 19782 samples
##    48 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17803, 17804, 17804, 17804, 17804, 17804, ... 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  RMSE       Rsquared   MAE      
##    2     5             0.9485338  0.6589756  0.3607999
##    2    10             0.9511516  0.6566675  0.3614408
##    6     5             0.8215157  0.6817859  0.2959122
##    6    10             0.8253773  0.6790895  0.2969019
##   10     5             0.8095380  0.6820717  0.2914426
##   10    10             0.8098394  0.6827083  0.2916268
##   14     5             0.8075560  0.6802940  0.2909837
##   14    10             0.8046769  0.6832490  0.2908628
##   18     5             0.8033923  0.6821224  0.2902635
##   18    10             0.8037805  0.6822839  0.2907471
## 
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 18, splitrule = variance
##  and min.node.size = 5.

Let’s use PAI and weighted ROC to measure the quality of the hotspot predictions.

Comparison of Random Forest and Cubist Models
rank id_grid_rf cum_crime_freq_rf prop_cum_area_rf prop_cum_crime_freq_rf PAI_rf RRI_rf id_grid_cu cum_crime_freq_cu prop_cum_area_cu prop_cum_crime_freq_cu PAI_cu RRI_cu
1 10922 38 0.000 0.007 121.722 1.102 10922 38 0.000 0.007 121.722 1.262
2 12189 44 0.000 0.008 70.470 1.834 12189 44 0.000 0.008 70.470 2.064
3 12401 64 0.000 0.011 68.335 1.851 12401 64 0.000 0.011 68.335 2.059
4 10923 125 0.000 0.022 100.100 1.199 10923 125 0.000 0.022 100.100 1.321
5 12649 158 0.000 0.028 101.221 1.105 6464 150 0.000 0.027 96.096 1.265
6 11101 161 0.000 0.028 85.953 1.225 12649 183 0.000 0.032 97.698 1.169
7 6464 186 0.000 0.033 85.114 1.176 8551 200 0.000 0.035 91.520 1.174
8 14140 205 0.000 0.036 82.082 1.170 12613 211 0.000 0.037 84.484 1.210
9 4349 225 0.000 0.040 80.080 1.154 4349 231 0.000 0.041 82.215 1.194
10 12613 236 0.001 0.042 75.596 1.181 14140 250 0.001 0.044 80.080 1.185

4 Model Comparison

## [1] "AUC for Cubist Model is: 0.876214710199852"
## [1] "AUC for Random Forest is: 0.888340064380848"
Model PAI Comparison Test 2020
PAI
Cum. # Assaults
Cum. Assault %
Cum. Area%
Rank RF Cubist RF Cubist RF Cubist Area
1 121.72 121.72 38 38 0.67 0.67 0.01
5 101.22 96.10 158 150 2.79 2.65 0.03
25 53.94 52.66 421 411 7.44 7.26 0.14
100 30.76 28.99 952 898 16.82 15.87 0.55
200 23.50 23.44 1461 1457 25.82 25.75 1.10